from IPython.display import Image
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
img = mpimg.imread('netflix1.png')
plt.figure(figsize=(14, 10))
plt.imshow(img)
plt.show()
img = mpimg.imread('netflix2.png')
plt.figure(figsize=(14, 10))
plt.imshow(img)
plt.show()
img = mpimg.imread('netflix3.png')
plt.figure(figsize=(14, 10))
plt.imshow(img)
plt.show()
Netflix is one of the most popular streaming services in the world. It offers a wide range of movies and Tv shows to its subscribers. Netflix is popular with its foreign-langauge, genre specific and binge-worthy content. It provides the audience with quality and original products which is why it records a huge success in the market. As each company Netflix, also relies on a huge amount of data (Big Data). As a big streaming company it collects data from its subscribers about their actions like what they watch the most, when they watch and how long they watch. This data is working also for their recommendation system. By analyzing subscribers viewing history and behavior Netflix offers content that the subscriber is most likely to be interested in.Hence, the audience stays engaged with the platform and benefits the company itself.
As having subscribers of Netflix in our group members we got interested in analyzing some patterns that are used in its Tv shows and movies. We decided to choose two datasets containing different types of information about Netflix like the names of the movies, the names of the TV shows, the producing year, the producers, etc. Having this amount of data gave us the opportunity to analyze some patterns in the content and provide some visualizations demonstrating them in a more clear way. The purpose of this paper is to clean, analyze and visualize the data we have explaining our steps in detail. The language we used for all the processes is Python.
Datasets As mentioned above we have two datasets netflix_titles .csv and imdb_top_1000 .csv. Both datasets are obtained from Kaggle.com which is one of the largest data science communities providing reliable and useful resources. netflix_titles.csv contains unlabelled text data of around 9000 Netflix Shows and Movies along with full details like Cast, Release Year, Rating, Description, etc. imdb_top_1000 .csv is an IMDB Dataset of top 1000 movies and tv shows. In addition to the datasets we have used a json file in our project called countries.geojson
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from wordcloud import WordCloud, STOPWORDS
import itertools
from textblob import TextBlob
Below are all the libraries we inserted. 1)Numpy is a library for numerical computing in Python. It provides tools for working with arrays and matrices. 2)Pandas is a library for data manipulation and analysis. It provides tools for data analysis. 3)Seaborn is a library for statistical data visualization. It provides a high-level interface for creating statistical graphics. 4)Matplotlib is a library for creating static, animated, and interactive visualizations in Python. It provides various tools for creating various types of plots. 5)Wordcloud is a library for creating word clouds in Python. A word cloud is a visual representation of text data, where the size of each word is proportional to its frequency in the text. 6)TextBlob is a Python library used for processing textual data. It is built on top of the Natural Language Toolkit (NLTK) library and provides a simple API for common natural language processing (NLP) tasks such as sentiment analysis, part-of-speech tagging, noun phrase extraction, and more. 7)And the last, itertools is a library for working with iterators, which are objects that can be looped over. It provides tools for creating, combining, and manipulating iterators.
df = pd.read_csv("./netflix_titles.csv")
df2 = pd.read_csv("./imdb_top_1000.csv")
pd.set_option('display.max_columns', None)
This is the first cell of our python code. The first two rows of the cell are creating two pandas dataframes one df and the second df2. Both lines are reading the csv files and allow us to work with the data inside of the datasets. The third line sets the requirement for pandas to display all the columns of the data frames because without setting the option to “None” the Pandas would limit the number of columns when displaying by default. Obviously, we needed to insert some libraries to make analysis and data visualization. Below are all the libraries we inserted.
df.isnull().sum()
show_id 0 type 0 title 0 director 2634 cast 825 country 831 date_added 10 release_year 0 rating 4 duration 3 listed_in 0 description 0 dtype: int64
isnull().sum()is a pandas method chain used to find the number of missing values in each column of a DataFrame df. Missing values can be zeros, NaNs, etc.
df["country"].fillna("MISSING", inplace=True)
df["duration"].fillna("0 min", inplace=True)
df["director"].fillna("Unknown", inplace=True)
df["cast"].fillna("Unknown", inplace=True)
df["date_added"].fillna("Unknown", inplace=True)
df["rating"].fillna("Unknown", inplace=True)
df["duration"].fillna("Unknown", inplace=True)
df.head(n=3)
C:\Users\User\AppData\Local\Temp/ipykernel_14984/3613727334.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
df["country"].fillna("MISSING", inplace=True)
C:\Users\User\AppData\Local\Temp/ipykernel_14984/3613727334.py:2: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
df["duration"].fillna("0 min", inplace=True)
C:\Users\User\AppData\Local\Temp/ipykernel_14984/3613727334.py:3: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
df["director"].fillna("Unknown", inplace=True)
C:\Users\User\AppData\Local\Temp/ipykernel_14984/3613727334.py:4: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
df["cast"].fillna("Unknown", inplace=True)
C:\Users\User\AppData\Local\Temp/ipykernel_14984/3613727334.py:5: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
df["date_added"].fillna("Unknown", inplace=True)
C:\Users\User\AppData\Local\Temp/ipykernel_14984/3613727334.py:6: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
df["rating"].fillna("Unknown", inplace=True)
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | Unknown | United States | September 25, 2021 | 2020 | PG-13 | 90 min | Documentaries | As her father nears the end of his life, filmm... |
| 1 | s2 | TV Show | Blood & Water | Unknown | Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... | South Africa | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, TV Dramas, TV Mysteries | After crossing paths at a party, a Cape Town t... |
| 2 | s3 | TV Show | Ganglands | Julien Leclercq | Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... | MISSING | September 24, 2021 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Act... | To protect his family from a powerful drug lor... |
df["country"].fillna("MISSING", inplace=True) line fills the missing values in the "country" column of the dataset with the "MISSING", which is a string.
df["duration"].fillna("0 min", inplace=True) line fills the missing values in the "duration" column with "0 min".
df["director"].fillna("Unknown", inplace=True) line fills the missing values in the "director" column with "Unknown".
df["cast"].fillna("Unknown", inplace=True) line fills the missing values in the "cast" column with "Unknown".
df["date_added"].fillna("Unknown", inplace=True) line fills the missing values in the "date_added" column with "Unknown".
df["rating"].fillna("Unknown", inplace=True) line fills the missing values in the "rating" column with "Unknown".
df["duration"].fillna("Unknown", inplace=True) line fills the missing values in the "duration" column with "Unknown".
In all the above cases the last words are strings. In all the cases we write inplace= True to indicate that we want to amke changes on the original dataset instead of creating a copy. At the end we get a dataframe with no missing and NaN values.
df.describe()
| release_year | |
|---|---|
| count | 8807.000000 |
| mean | 2014.180198 |
| std | 8.819312 |
| min | 1925.000000 |
| 25% | 2013.000000 |
| 50% | 2017.000000 |
| 75% | 2019.000000 |
| max | 2021.000000 |
describe() is a method which returns a summary of the central tendency, dispersion, and shape of the distribution of the columns of our DataFrame. In the output we can see the words "count", "mean","std","min","25%","50%","75%","max". These are the information about the columns of our dataframe already cleaned from missing vales. Here count: The number of non-null values in each column of the DataFrame mean is the average of each column,std is the standars deviation, min is the minimum value, 25% is the 25% of each column, similarly are 50% and 75% and the max is the maximum value of the coumns.
df.shape
(8807, 12)
df.shape returns a tuple with the number of rows and the number of columns in the dataframe. In our case the number of the rows is 8807 and the number of columns is 12.
df.columns
Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
'release_year', 'rating', 'duration', 'listed_in', 'description'],
dtype='object')
In pandas the method df.columns returns the names of the columns. We can see the names of the columns of our dataset above and we can also see the type of it which is "object".
df.count()
show_id 8807 type 8807 title 8807 director 8807 cast 8807 country 8807 date_added 8807 release_year 8807 rating 8807 duration 8807 listed_in 8807 description 8807 dtype: int64
count() method returns the number of non-null values in each column of a DataFrame. This method can be used to quickly identify missing values in a Dataframe.
df.nunique()
show_id 8807 type 2 title 8807 director 4529 cast 7693 country 749 date_added 1768 release_year 74 rating 18 duration 221 listed_in 514 description 8775 dtype: int64
nunique() method returns the number of unique values in each column of a Dataframe.
print(f" dtype - show_id: {df.show_id.dtype}")
print(f" dtype - type: {df.type.dtype}")
print(f" dtype - title: {df.title.dtype}")
print(f" dtype - director: {df.director.dtype}")
print(f" dtype - cast: {df.cast.dtype}")
print(f" dtype - country: {df.country.dtype}")
print(f" dtype - date_added: {df.date_added.dtype}")
print(f" dtype - release_year: {df.release_year.dtype}")
print(f" dtype - rating: {df.rating.dtype}")
print(f" dtype - duration: {df.duration.dtype}")
print(f" dtype - listed_in: {df.listed_in.dtype}")
print(f" dtype - description: {df.description.dtype}")
dtype - show_id: object dtype - type: object dtype - title: object dtype - director: object dtype - cast: object dtype - country: object dtype - date_added: object dtype - release_year: int64 dtype - rating: object dtype - duration: object dtype - listed_in: object dtype - description: object
Instead of writing for each row of the code the first row will be explained as the others are the same just for different columns.
print(f" dtype - show_id: {df.show_id.dtype}") We are printing the data type of the column. (f" dtype - show_id) this is for having a string with the name of the column and the meanoning of our code in the output.
From the output we can see that we have 11 columns with type object and 1 column with type int64
df.dropna(axis="index", how="all")
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | Unknown | United States | September 25, 2021 | 2020 | PG-13 | 90 min | Documentaries | As her father nears the end of his life, filmm... |
| 1 | s2 | TV Show | Blood & Water | Unknown | Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... | South Africa | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, TV Dramas, TV Mysteries | After crossing paths at a party, a Cape Town t... |
| 2 | s3 | TV Show | Ganglands | Julien Leclercq | Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... | MISSING | September 24, 2021 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Act... | To protect his family from a powerful drug lor... |
| 3 | s4 | TV Show | Jailbirds New Orleans | Unknown | Unknown | MISSING | September 24, 2021 | 2021 | TV-MA | 1 Season | Docuseries, Reality TV | Feuds, flirtations and toilet talk go down amo... |
| 4 | s5 | TV Show | Kota Factory | Unknown | Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... | India | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, Romantic TV Shows, TV ... | In a city of coaching centers known to train I... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8802 | s8803 | Movie | Zodiac | David Fincher | Mark Ruffalo, Jake Gyllenhaal, Robert Downey J... | United States | November 20, 2019 | 2007 | R | 158 min | Cult Movies, Dramas, Thrillers | A political cartoonist, a crime reporter and a... |
| 8803 | s8804 | TV Show | Zombie Dumb | Unknown | Unknown | MISSING | July 1, 2019 | 2018 | TV-Y7 | 2 Seasons | Kids' TV, Korean TV Shows, TV Comedies | While living alone in a spooky town, a young g... |
| 8804 | s8805 | Movie | Zombieland | Ruben Fleischer | Jesse Eisenberg, Woody Harrelson, Emma Stone, ... | United States | November 1, 2019 | 2009 | R | 88 min | Comedies, Horror Movies | Looking to survive in a world taken over by zo... |
| 8805 | s8806 | Movie | Zoom | Peter Hewitt | Tim Allen, Courteney Cox, Chevy Chase, Kate Ma... | United States | January 11, 2020 | 2006 | PG | 88 min | Children & Family Movies, Comedies | Dragged from civilian life, a former superhero... |
| 8806 | s8807 | Movie | Zubaan | Mozez Singh | Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan... | India | March 2, 2019 | 2015 | TV-14 | 111 min | Dramas, International Movies, Music & Musicals | A scrappy but poor boy worms his way into a ty... |
8807 rows × 12 columns
dropna() method is used to remove missing values from a DataFrame. The axis parameter specifies whether to remove rows or columns that contain missing values.how parameter is the condition for removing a row or column.
# df["duration"].unique()
unique() method is used to get an array of unique values in a Dataframe column.In our case unique() will return an array of unique values in the "duration" column.
l_num_dur=list() #creates a new list
l_seas_min=list() #creates another new list
for i in df["duration"]: #loop for duration column
num_dur=int(i.split()[0]) #to get the integer value
w_dur=i.split()[1] #to get the word Seasons/min
l_num_dur.append(num_dur) #will add the numerical value of duration
l_seas_min.append(w_dur) #will add either Seasons or min
df["Number duration"]=l_num_dur #puts the values in the list
df["Season/min"]=l_seas_min #puts the values in the list
df.head(n=3) #prints the dataframe
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | Number duration | Season/min | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | Unknown | United States | September 25, 2021 | 2020 | PG-13 | 90 min | Documentaries | As her father nears the end of his life, filmm... | 90 | min |
| 1 | s2 | TV Show | Blood & Water | Unknown | Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... | South Africa | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, TV Dramas, TV Mysteries | After crossing paths at a party, a Cape Town t... | 2 | Seasons |
| 2 | s3 | TV Show | Ganglands | Julien Leclercq | Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... | MISSING | September 24, 2021 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Act... | To protect his family from a powerful drug lor... | 1 | Season |
The above code is to split the duration column into two parts integers and Seasons or min. Each line is explained in the code cell.
df["rating"].unique()
array(['PG-13', 'TV-MA', 'PG', 'TV-14', 'TV-PG', 'TV-Y', 'TV-Y7', 'R',
'TV-G', 'G', 'NC-17', '74 min', '84 min', '66 min', 'NR',
'Unknown', 'TV-Y7-FV', 'UR'], dtype=object)
In this case unique() will return an array of unique values in the "rating" column.
Here, as we can see, we have a problem, since three values of from "duration" column have been placed in the "rating" column. What we will do, is simply getting their indices and then we will assign them their true vallues. Those values in "rating" column will be marked as "unknown", while in the duration column, they will get their true values.
print(df.loc[df["rating"]=="74 min", "duration"])
print(df.loc[df["rating"]=="84 min", "duration"])
print(df.loc[df["rating"]=="66 min", "duration"])
#There were 3 Nan values in the duration column, and the original values were put in ratings column
5541 0 min Name: duration, dtype: object 5794 0 min Name: duration, dtype: object 5813 0 min Name: duration, dtype: object
df.loc[df["rating"]=="74 min", "duration"]="74 min"
df.loc[df["rating"]=="84 min", "duration"]="84 min"
df.loc[df["rating"]=="66 min", "duration"]="66 min"
df.loc[df["rating"]=="74 min", "rating"]="Unknown"
df.loc[df["rating"]=="84 min", "rating"]="Unknown"
df.loc[df["rating"]=="66 min", "rating"]="Unknown"
print(df["type"].unique()) #here we see that there are only two types: Movie and TV show
df_movies = df[df["type"] == "Movie"]
df_series = df[df["type"] == "TV Show"]
['Movie' 'TV Show']
Two new DataFrames, one for movies and one for TV series created from original df enable us to compare the numerical variety of TV series and movies independently.
df_movies.value_counts()
n_of_movies_and_tv_series = df["title"].nunique()
n_of_movies = df_movies["title"].count()
n_of_shows = df_series["title"].count()
print(f"There are {n_of_movies} movies and {n_of_shows} TV shows in our dataset")
#that means that netflix produced more movies than TV shows
n_of_movies_and_tv_series
There are 6131 movies and 2676 TV shows in our dataset
8807
Calculating the number of movies and TV shows in the dataset, some basic analysis is performed above.
percentage_movies = (n_of_movies / n_of_movies_and_tv_series) * 100
percentage_tvshows = (n_of_shows / n_of_movies_and_tv_series) * 100
labels = ['Movies', 'TV Shows']
values = [percentage_movies, percentage_tvshows]
colors_pie = ['#E50914', '#8C8C8C', '#221F1F']
plt.figure(figsize=(8,8))
plt.pie(values, colors = colors_pie, autopct='%1.1f%%', textprops={'fontsize': 14})
plt.title('Movies and TV Shows')
plt.legend(loc='upper right', labels=labels)
plt.show()
By dividing the number of movies by the entire number of movies and TV programs, and the number of TV shows by the total number of movies and TV shows, the percentage of movies and TV shows is calculated. The pie chart is then shown with the labels "Movies" and "TV Shows" and matching colors using the matplotlib tool. The pie chart makes it easier to see the distribution of movies and TV series included in the dataset.
count_dict = {} #creates an empty dictionary
for year in df["release_year"]: #iterates over the dictionary
if year in count_dict: #checks whether that year already exists in dictionary as a key
count_dict[year] += 1 #if yes, adds 1 to its value
else:
count_dict[year] = 1 #if no, adds it to the dictionary as a key
max_n = max(count_dict.values()) #finds the maximum number of movies
min_n = min(count_dict.values()) #finds the minimum number of movies
list_of_min = []
list_of_max = []
for i in count_dict.keys():
if count_dict[i] == max_n:
list_of_max.append(i) #iterating over dictionary keys, finds what keys have the value = to maximum number of movies, and add to the list
if count_dict[i] == min_n:
list_of_min.append(i) #the same with minimum number of movies
print(f"{list_of_max} is/are the year(s) when the highest number of film/series available on Netflix were produced.")
print(f"{list_of_min} is/are the year(s) when the lowest number of film/series available on Netflix were produced.")
[2018] is/are the year(s) when the highest number of film/series available on Netflix were produced. [1961, 1959, 1925, 1966, 1947] is/are the year(s) when the lowest number of film/series available on Netflix were produced.
This algorithm examines the dataset to identify the years where the streaming service produced the most and the least TV shows and movies.
keys = list(count_dict.keys()) #creates a list from dictionary keys
keys.sort() #sorts the list to avoid the mess in the graph
value_counts_2 = {} #creates an empty dictionary
for i in keys:
value_counts_2[i] = count_dict[i]
keys_2 = list(value_counts_2.keys())
values_2 = list(value_counts_2.values())
plt.figure(figsize=(12,10))
plt.plot(keys_2, values_2, color="red")
plt.xlabel("Years")
plt.xlim(1997,2023) #since the netflix was founded only in 1997
plt.ylabel("Number of movies")
plt.title("Number of produced movies over years")
plt.show()
This algorithm examines the dataset to identify the years where the streaming service produced the most and the least TV shows and movies.
rating_list = list(df["rating"].unique()) # creates a list of all unique movie ratings in the dataset
rating_list.remove("Unknown") # removes the "Unknown" rating from the list
rating_count_dict = {} # creates an empty dictionary to store the counts of each rating
for rating in df["rating"]: # Iterates over each movie in the dataset
# if the rating is already in the dictionary, increment its count
if rating in rating_count_dict:
rating_count_dict[rating] += 1
# if the rating is not "Unknown" and not already in the dictionary, add it with a count of 1
elif rating != "Unknown":
rating_count_dict[rating] = 1
keys_3 = rating_count_dict.keys() #gets the keys(ratings)
values_3 = rating_count_dict.values() #gets the values(counts)
plt.figure(figsize=(12,8))# creates a figure with a size of 12 x 8 inches
plt.bar(keys_3, values_3, color="darkred", ec="black")
#creates a bar chart with the ratings on the x-axis, counts on the y-axis, dark red bars, and black edges
plt.xticks(rotation=45) #rotates the x-axis labels by 45 degrees for readability
plt.ylabel("Number of movies") #sets y-axis label
plt.xlabel("Rating") #sets x-axis label
plt.title("Motion Picture Association film rating system") #sets the title
plt.show() #displays the chart
The graph makes it easier to see how Netflix movies and TV series are rated. It demonstrates that most of the Netflix selection of movies and TV series are rated for mature audiences, with TV-MA and TV-14 following closely behind. The graph also includes a limited number of G-, PG-, and TV-G-rated films and television programs that are appropriate for younger viewers. Overall, this code offers useful information about the kinds of Netflix material that are offered and the intended viewers for each rating group.
Let's create something interesting. What about a wordcloud with the words in it, that were used to create the title for each movie. What will this give us? We will find out which words were used the most for more than 8000 movies when creating their titles.
title_text = ' '.join(df['title'].dropna())
wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', colormap="Reds", width=800, height=400).generate(title_text)
plt.figure(figsize=(12,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
Based on the titles of the films and TV series in the dataset, this code creates a word cloud. The WordCloud class from the wordcloud library is used for this purpose. Those words that appear most frequently in the names of the films and TV series in the dataset are represented visually in the resulting word cloud.
director_counts = df[df["director"]!="Unknown"]['director'].value_counts()
popular_directors=director_counts[director_counts>10]
colormap=cm.ScalarMappable(cmap=cm.Reds) # create a colormap
colormap.set_clim(popular_directors.min(), popular_directors.max()) # set the limits for the colormap
#color=colormap.to_rgba(popular_directors)
plt.figure(figsize=(12,8)) # creates a figure with a size of 12 x 8 inches
popular_directors.plot(kind='bar', color=colormap.to_rgba(popular_directors), ec="black" ) # plot a bar chart with the colormap
plt.title('Number of Movies Directed by Each Director (top 10 directors)') # set the title
plt.xlabel('Director') # set the x-axis label
plt.ylabel('Number of Movies') # set the y-axis label
plt.xticks(rotation=45, fontsize = 12) # rotate the x-axis labels
plt.show() # display the plot
By only taking into account filmmakers who have directed more than 10 films, this code creates a bar plot showing the number of films each director (or the group of directors (if they worked on the same project together)) in the dataset has directed. The popular_directors variable is used to filter out just the filmmakers who have directed more than 10 films, and the value_counts() function is used to count the number of films each director has directed. Using a colormap, the bars' colors are determined by the amount of films that each filmmaker has directed. The plot is then constructed using the popular_directors.plot() method with appropriate labels, a clear title, and rotation. The plot makes it easier to see which directors have produced the most Netflix films.
duration_of_movies=df[df["Season/min"]=="min"]["Number duration"]
plt.figure(figsize=(12,8))
plt.hist(duration_of_movies, bins=25, color="red", ec="black")
plt.title("Duration Distribution")
plt.xlabel("Movie Duration")
plt.ylabel("Number of movies")
plt.show()
We display the dataset's distribution of movie runtimes here. To extract the duration values, we first filter the dataframe to only include the rows representing movies (not TV programmes). Then, using the plt.hist() function and 25 bins, we plot a histogram of the durations. The figure demonstrates that the majority of the dataset's films have running times between 70 and 120 minutes, peaking around 90 minutes. There are a very small number of movies that are longer than 200 minutes, but they do exist. This story can offer us a general notion of how long Netflix movies tend to be, which can be helpful for content producers who wish to make movies that appeal to the platform's customers.
duration_of_tvshows=df[df["Season/min"]=="Seasons"]["Number duration"]
plt.figure(figsize=(12,8))
plt.hist(duration_of_tvshows, bins=10, color="black", ec="red")
plt.title("Duration Distribution")
plt.xlabel("TV Show Duration")
plt.ylabel("Number of TV Shows")
plt.show()
Just like the previous one, we display the durations of TV shows. This histogram indicates that most Netflix TV programs have between 1-4 seasons, with 1-2 seasons having the highest frequency. A significant percentage of TV series have run times of 5–10 seasons, while a much smaller proportion have run times of more than 10.
df[df["Season/min"]=="min"].groupby(df["release_year"])["Number duration"].agg(["mean", "max", "min"])
mean_durations=df[df["Season/min"]=="min"].groupby(df["release_year"])["Number duration"].mean()
plt.figure(figsize=(12,8))
plt.bar(mean_durations.index, mean_durations.values, color="white", edgecolor="red")
plt.title('Average movie duration for each year')
plt.xlabel('year')
plt.ylabel('average movie duration')
plt.show()
According to the plot, the average movie runtime has varied throughout the years, generally growing since the early 2000s. It should be emphasized that this research solely takes into account films and ignores TV shows. As a result, the story only offers a fragmented view of the overall rise in Netflix content duration.
Sentiment analysis is used in order to understand the emotion or the attitude of something. In our case, by applying this analysis for the "Description" column, we can determine whether the content on Netflix mainly is positive or not.
For this, we import Textblob library. Next, we use the sentiment function, whicih has two properties polarity and subjectivity. We will focus on the polarity part. It returns number from -1 to 1, if it is >0 than the content is positive, if <0, then negative, if it is equal to zero, then the content is neutral.
for index,row in df.iterrows():
d=row['description']
d_blob=TextBlob(d)
d_polarity=d_blob.sentiment.polarity
if d_polarity<0:
d_sentiment='Negative'
elif d_polarity>0:
d_sentiment='Positive'
else:
d_sentiment='Neutral'
df.loc[[index,2],'Description_Sentiment']=d_sentiment
d_sentiment_counts=df.groupby(["release_year", "Description_Sentiment"]).size().unstack()
d_sentiment_counts = d_sentiment_counts.loc[2000:]
colors = {'Negative': 'red', 'Neutral': 'black', 'Positive': 'grey'}
plt.figure(figsize=(30,30))
d_sentiment_counts.plot(kind='bar', stacked=True, color=[colors[c] for c in d_sentiment_counts.columns])
plt.xlabel('Year')
plt.ylabel('Number of movies')
plt.title('Sentiment Content of Netflix')
plt.show()
<Figure size 2160x2160 with 0 Axes>
As we can see, over years, the movies on netflix have become more positive, since the number of movies with positive content has increased signinficantly.
countries_list = [] #creates a new list for countries
for i in df["country"]: #iterates over the values in column "Countries"
if "," in i: #finds the rows which have several values for "Countries", separated by commas
a=str(i).split(sep = ",") #creates a new variable a and assigns to it splitted values in form of list
# list_a = list(a) will delete this part
for b in a: #iterates over list a with separated countries
countries_list.append(b) #appends the countries to the initial list
else:
countries_list.append(i) #if there are no commas, it means that only one country is mentioned, so we append that country to the list
new_countries_list =[x for x in countries_list if x != "MISSING" and x != ""] #deletes all the "Missing" values from countries_list and assigns to a new list
new_countries_list_2 = [x.strip() for x in new_countries_list] #since some of them come with unnecessary spaces, strip deletes those spaces
from collections import Counter
country_counts = Counter(new_countries_list_2) #creates a Counter object and assigns it to country_counts
country_counts_dict = dict(country_counts) #converts the Counter object country_counts to a regular dictionary country_counts_dict.
countries = country_counts_dict.keys() #creates a new variable called countries, which contains a list of all the unique country names in new_countries_list_2.
n_of_movies_countries = country_counts_dict.values() #creates a new variable called n_of_movies_countries, which contains a list of the number of movies produced in each country.
The Counter object counts the number of occurrences of each country in the list, and returns a dictionary-like object where the keys are the countries and the values are the counts.
The keys() method is used to extract the keys (i.e., the country names) from the dictionary country_counts_dict.
The values() method is used to extract the values (i.e., the counts) from the dictionary country_counts_dict.
The code above creates a dictionary, where the keys represent the countries and the values show how many movies were produced in each country
import json
import folium
geo_json_data = json.load(open('countries.geojson')) #loads the countries.geojson file, which contains geographic data for all the countries in the world
df_2 = pd.DataFrame({'Country': list(countries), 'Movie Count': list(n_of_movies_countries)}) #creates a new pandas DataFrame called df_2
#Country column contains a list of all the unique country names in new_countries_list_2
#Movie Count column contains a list of the number of movies produced in each country.
map = folium.Map(location=[37.0902, -95.7129], zoom_start=2) #creates a new folium Map object, which will be used to display the choropleth map.
#"Location" sets the center of map to be United States
#"Zoom start" sets the zoom level of the map
folium.Choropleth(
geo_data= geo_json_data,
name='choropleth',
data=df_2,
columns=['Country', 'Movie Count'],
key_on='properties.ADMIN', #specifies the key in the geo_json_data that matches the country names in df_2
fill_color='YlGn',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Number of Movies', #sets the label
highlight=True
).add_to(map)
map